import pandas as pd
admission = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
admission.head()| admit | gre | gpa | rank | |
|---|---|---|---|---|
| 0 | 0 | 380 | 3.61 | 3 |
| 1 | 1 | 660 | 3.67 | 3 |
| 2 | 1 | 800 | 4.00 | 1 |
| 3 | 1 | 640 | 3.19 | 4 |
| 4 | 0 | 520 | 2.93 | 4 |
EBA3500 Data analysis with programming
Jonas Moss
BI Norwegian Business School
Department of Data Science and Analytics
We’ll work with an example from this page about admission to colleges.
This dataset has a binary response (outcome, dependent) variable called admit. There are three predictor variables: gre, gpa and rank. We will treat the variables gre and gpa as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.
Whoops! Doesn’t work due to a bug in matplotlib, I belive.
array([[ 1. , 0.18443428, 0.17821225, -0.24251318],
[ 0.18443428, 1. , 0.38426588, -0.12344707],
[ 0.17821225, 0.38426588, 1. , -0.05746077],
[-0.24251318, -0.12344707, -0.05746077, 1. ]])
admit and rank at all!gre.admit affected by gre?lmplot imposes the simple linear regression curve on the data.
The line is the predicted probability of admission, the dots are \(0\) if admitted, \(1\) if not.
admit affected by gre per level of rank?The third plot indicates the problem: The line may predict an impossible probability!
import numpy as np
x = np.linspace(0, 1, 11)
y = np.array([0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1])
sns.lmplot(x="x", y="y", data=pd.DataFrame({'x':x, 'y':y}))
plt.show()| Name | Link function \(g\) | \(g^{-1} = F\) | Where? |
|---|---|---|---|
| Logistic / logit | \(\log\left(\frac{x}{1-x}\right)\) | \(\frac{1}{1+e^{-x}}\) | Machine learning, statistics |
| Probit | \(\Phi^{-1}\), the normal quantile function | \(\Phi\), the normal CDF | Economics, statistics |
| Cauchit | \(\tan(\pi(x-0.5))\) | \(\frac{1}{\pi}\arctan(x)\) | Used for heavy-tailed data |
curve_fitWe can use stats.curve_fit to fit binary regression models. \[\min_\beta \sum_{i=1}^{n}(y_{i}-F(x_i^T\beta))^{2},\]
from scipy.optimize import curve_fit
from scipy import stats
x = np.linspace(0, 1, 11)
y = np.array([0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1])
func_logistic = lambda x, a, b: logistic(a + b * x)
param_logistic = curve_fit(func_logistic, x, y)[0]
func_norm = lambda x, a, b: stats.norm.cdf(a + b * x)
param_norm = curve_fit(func_norm, x, y)[0]
#plt.scatter(x, y)
x_ = np.linspace(0, 1, 100)
sns.lmplot(x="x", y="y", data=pd.DataFrame({'x':x, 'y':y}))
plt.plot(x_, func_norm(x_, param_norm[0], param_norm[1]), color = "red")
plt.plot(x_, func_logistic(x_, param_logistic[0], param_logistic[1]), color = "black")
plt.show()statsmodelsimport statsmodels.formula.api as smf
import statsmodels.api as sm
probit = smf.probit("admit ~ gre", data=admission).fit()
logit = smf.logit("admit ~ gre", data=admission).fit()Optimization terminated successfully.
Current function value: 0.607481
Iterations 5
Optimization terminated successfully.
Current function value: 0.607570
Iterations 5
mod_logistic = smf.glm(formula="admit ~ gre", data=admission, family=sm.families.Binomial()).fit()
print(mod_logistic.summary()) Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: admit No. Observations: 400
Model: GLM Df Residuals: 398
Model Family: Binomial Df Model: 1
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -243.03
Date: Thu, 17 Nov 2022 Deviance: 486.06
Time: 15:05:32 Pearson chi2: 399.
No. Iterations: 4 Pseudo R-squ. (CS): 0.03420
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -2.9013 0.606 -4.787 0.000 -4.089 -1.714
gre 0.0036 0.001 3.633 0.000 0.002 0.006
==============================================================================
Intercept 0.901014
x 1.891271
dtype: float64
| Error (\(\epsilon\)) | Link function | Model name |
|---|---|---|
| \(N(0, \sigma)\) | \(\Phi^{-1}(x)\) | Probit model |
| \(\operatorname{Logistic}(0, \sigma)\) | \(\log\frac{x}{1-x}\) | Logistic model |
| \(\operatorname{Cauchy}(0, \sigma)\) | \(\tan{[\pi(x-0.5)]}\) | Cauchit |
\(\Phi^{-1}(x)\) is the quantile function for the normal distribution. All of these distributions are symmetric, that is, \(f(x) = f(-x)\).
Suppose \(Z_{i}=\beta^{T}X_{i}+\epsilon_{i}\) with \(\epsilon_{i}\sim F\) and define \(Y_{i}=1[Z_{i}\geq0]\).
Then \[\begin{eqnarray*} P(Y_{i}=1\mid X_{i}) & = & P(Z_{i}\geq0\mid X_{i}),\\ & = & P(\beta^{T}X_{i}+\epsilon_{i}\geq0\mid X_{i}),\\ & = & P(\epsilon_{i}\geq-\beta^{T}X_{i}\mid X_{i}),\\ & = & 1-F(-\beta^{T}X_{i}). \end{eqnarray*}\]
Thus \(G^{-1} = 1-F(-\beta^T X_i)\).
If \(F\) is symmetric around \(0\), i.e., \(f(x)=f(-x)\) then \(1-F(-x)=F(x)\), hence \(G^{-1} = F\)
We simulate from a normal with scale = 5:
Optimization terminated successfully.
Current function value: 0.681655
Iterations 4
Intercept 0.051516
x 0.317042
dtype: float64
And we fit a model with scale = 1 too:
How much do you like Coca-Cola? Answer on a scale from -5 to 5?
Such questions have high cognitive load! They are hard to answer.
Do you like Coca-Cola?
Easy to answer! Just say yes or no.
Optimization terminated successfully.
Current function value: 0.574351
Iterations 5
To intepret the coefficients we must exponentiate them.
Hence increasing gpa with 1 increases the probability of being admitted by \(60\%\), everything else held equal. Increasing the rank of the university by one decreases the probability by \(30\%\) though.
statsmodels, but other link functions require more work.